Jonathan Christyadi (502705) - AI Core 02
This notebook aims at predicting the likelihood of a link being a phishing link or a legitimate link with a focus on exploring and testing hypotheses that necessitate further research.
import sklearn
import pandas as pd
import seaborn
import numpy as np
print("scikit-learn version:", sklearn.__version__) # 1.1.3
print("pandas version:", pd.__version__) # 1.5.1
print("seaborn version:", seaborn.__version__) # 0.12.1
scikit-learn version: 1.4.1.post1 pandas version: 2.2.1 seaborn version: 0.13.2
After loading the dataset, I found out some inconsistencies among the data. First the label of the link (phishing or legitimate) can be changed into binary format. Also, for domain_with_copyright column, some are in binary and some are written in alphabets, for example: zero, One, etc.
df = pd.read_csv("Data\dataset_link_phishing.csv", sep=',', index_col=False, dtype='unicode')
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | phishing |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | phishing |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | phishing |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | legitimate |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | legitimate |
5 rows × 87 columns
df.sample(5)
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 15739 | 7738 | https://cteam-my.sharepoint.com/:o:/g/personal... | 126 | 23 | 1 | 2 | 1 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 382 | 8018 | 0 | 0 | 1 | 4 | phishing |
| 3077 | 3077 | http://doc.google.share.pressurecookerindia.co... | 150 | 40 | 1 | 5 | 0 | 0 | 1 | 0 | ... | 1 | zero | 0 | 343 | 4405 | 0 | 0 | 1 | 0 | phishing |
| 11363 | 3362 | https://www.sonlight.com/ | 25 | 16 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1379 | 7753 | 140382 | 0 | 0 | 4 | legitimate |
| 13001 | 5000 | https://grabyourcode.com/paypal/adder/index.html | 48 | 16 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 284 | 1541 | 2573053 | 0 | 0 | 0 | phishing |
| 7827 | 7827 | http://www.acostamueble.com/img/ | 32 | 20 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 888 | 5321 | 0 | 0 | 1 | 2 | phishing |
5 rows × 87 columns
columns = df.columns.tolist()
with open("output.txt", "w") as file:
for column in columns:
file.write(column + "\n")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 19431 entries, 0 to 19430 Data columns (total 85 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url_length 19431 non-null int64 1 hostname_length 19431 non-null int64 2 ip 19431 non-null object 3 total_of. 19431 non-null int64 4 total_of- 19431 non-null int64 5 total_of@ 19431 non-null object 6 total_of? 19431 non-null int64 7 total_of& 19431 non-null object 8 total_of= 19431 non-null object 9 total_of_ 19431 non-null object 10 total_of~ 19431 non-null object 11 total_of% 19431 non-null object 12 total_of/ 19431 non-null int64 13 total_of* 19431 non-null object 14 total_of: 19431 non-null object 15 total_of, 19431 non-null object 16 total_of; 19431 non-null object 17 total_of$ 19431 non-null object 18 total_of_www 19431 non-null int64 19 total_of_com 19431 non-null object 20 total_of_http_in_path 19431 non-null object 21 https_token 19431 non-null object 22 ratio_digits_url 19431 non-null float64 23 ratio_digits_host 19431 non-null object 24 punycode 19431 non-null object 25 port 19431 non-null object 26 tld_in_path 19431 non-null object 27 tld_in_subdomain 19431 non-null object 28 abnormal_subdomain 19431 non-null object 29 nb_subdomains 19431 non-null object 30 prefix_suffix 19431 non-null object 31 random_domain 19431 non-null object 32 shortening_service 19431 non-null object 33 path_extension 19431 non-null object 34 nb_redirection 19431 non-null object 35 nb_external_redirection 19431 non-null object 36 length_words_raw 19431 non-null object 37 char_repeat 19431 non-null object 38 shortest_words_raw 19431 non-null object 39 shortest_word_host 19431 non-null object 40 shortest_word_path 19431 non-null object 41 longest_words_raw 19431 non-null object 42 longest_word_host 19431 non-null object 43 longest_word_path 19431 non-null object 44 avg_words_raw 19431 non-null object 45 avg_word_host 19431 non-null object 46 avg_word_path 19431 non-null object 47 phish_hints 19431 non-null int64 48 domain_in_brand 19431 non-null object 49 brand_in_subdomain 19431 non-null object 50 brand_in_path 19431 non-null object 51 suspecious_tld 19431 non-null object 52 statistical_report 19431 non-null object 53 nb_hyperlinks 19431 non-null int64 54 ratio_intHyperlinks 19431 non-null object 55 ratio_extHyperlinks 19431 non-null object 56 ratio_nullHyperlinks 19431 non-null object 57 nb_extCSS 19431 non-null object 58 ratio_intRedirection 19431 non-null object 59 ratio_extRedirection 19431 non-null object 60 ratio_intErrors 19431 non-null object 61 ratio_extErrors 19431 non-null object 62 login_form 19431 non-null object 63 external_favicon 19431 non-null object 64 links_in_tags 19431 non-null object 65 submit_email 19431 non-null object 66 ratio_intMedia 19431 non-null object 67 ratio_extMedia 19431 non-null object 68 sfh 19431 non-null object 69 iframe 19431 non-null object 70 popup_window 19431 non-null object 71 safe_anchor 19431 non-null object 72 onmouseover 19431 non-null object 73 right_clic 19431 non-null object 74 empty_title 19431 non-null object 75 domain_in_title 19431 non-null int64 76 domain_with_copyright 19431 non-null int32 77 whois_registered_domain 19431 non-null object 78 domain_registration_length 19431 non-null object 79 domain_age 19431 non-null object 80 web_traffic 19431 non-null object 81 dns_record 19431 non-null object 82 google_index 19431 non-null int64 83 page_rank 19431 non-null int64 84 status 19431 non-null int64 dtypes: float64(1), int32(1), int64(13), object(70) memory usage: 12.5+ MB
# Sampling the dataset
df.sample(10)
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12368 | 4367 | http://bridgeburglar.com/bridge-burglars-guide... | 77 | 17 | 1 | 1 | 6 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 161 | 2761 | 4459552 | 0 | 0 | 1 | legitimate |
| 2116 | 2116 | http://sanangelo.iconcinemas.com/ | 33 | 25 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 135 | 3153 | 9482009 | 0 | 0 | 3 | legitimate |
| 14707 | 6706 | http://nintendo.wikia.com/wiki/Nintendo_Switch | 46 | 18 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 140 | 6070 | 14420 | 0 | 0 | 5 | legitimate |
| 969 | 969 | https://www.justice.gov/atr/blame-switchman-ru... | 90 | 15 | 0 | 2 | 7 | 0 | 0 | 0 | ... | 1 | one | 0 | 0 | -1 | 4382 | 0 | 0 | 6 | legitimate |
| 8546 | 545 | https://www.azurepower.com/ | 27 | 18 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 1119 | 4724 | 942542 | 0 | 0 | 4 | legitimate |
| 2176 | 2176 | https://mail.parkhill.k12.mo.us/owa/auth/logon... | 123 | 23 | 0 | 9 | 0 | 0 | 1 | 1 | ... | 1 | zero | 1 | 0 | -1 | 105946 | 0 | 1 | 4 | phishing |
| 15179 | 7178 | https://login.microsoftonline.com/decee90c-ce0... | 557 | 25 | 1 | 5 | 24 | 0 | 1 | 9 | ... | 1 | 1 | 0 | 350 | 6589 | 30 | 0 | 1 | 4 | legitimate |
| 17788 | 9787 | http://www.payscale.com/research/US/Job=Magnet... | 97 | 16 | 1 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1290 | 7841 | 3990 | 0 | 0 | 5 | legitimate |
| 4291 | 4291 | http://starmak.com.tr/950CAAEA0281AA2BEBED8F9E... | 76 | 14 | 1 | 2 | 0 | 0 | 1 | 0 | ... | 1 | zero | 0 | 0 | 4376 | 0 | 0 | 1 | 1 | phishing |
| 3559 | 3559 | https://s.free.fr/92rsZcB4 | 26 | 9 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 518 | 7800 | 2868149 | 0 | 1 | 5 | phishing |
10 rows × 87 columns
df['status'] = df['status'].map({'phishing': 1, 'legitimate': 0})
df.head()
| id | url | url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | http://www.progarchives.com/album.asp?id=61737 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | ... | 1 | one | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 1 | http://signin.eday.co.uk.ws.edayisapi.dllsign.... | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 2 | http://www.avevaconstruction.com/blesstool/ima... | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | ... | 1 | zero | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 3 | http://www.jp519.com/ | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 1 | one | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | https://www.velocidrone.com/ | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | zero | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 87 columns
df['domain_with_copyright'] = df['domain_with_copyright'].map({'one': 1, 'zero': 0, 'Zero': 0, 'One': 1,'1': 1, '0': 0}).astype(int)
df['domain_with_copyright'].unique()
array([1, 0])
# Calculate the total number of missing values in the DataFrame
total_na = df.isna().sum()
# Calculate the total number of missing values in the DataFrame
total_null = df.isnull().sum()
total_null.sum()
0
# Finding columns with binary values
def count_binary_columns(df):
results = []
counter = 0
for col in df.columns:
counter += 1
if df[col].isin([0, 1]).all():
results.append(col)
return results, counter
count_binary_columns(df)
(['domain_in_title', 'domain_with_copyright', 'google_index', 'status'], 85)
df = df.drop(columns=['id', 'url'])
df.head()
| url_length | hostname_length | ip | total_of. | total_of- | total_of@ | total_of? | total_of& | total_of= | total_of_ | ... | domain_in_title | domain_with_copyright | whois_registered_domain | domain_registration_length | domain_age | web_traffic | dns_record | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 46 | 20 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 1 | 1 | 0 | 627 | 6678 | 78526 | 0 | 0 | 5 | 1 |
| 1 | 128 | 120 | 0 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 300 | 65 | 0 | 0 | 1 | 0 | 1 |
| 2 | 52 | 25 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 119 | 1707 | 0 | 0 | 1 | 0 | 1 |
| 3 | 21 | 13 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 130 | 1331 | 0 | 0 | 0 | 0 | 0 |
| 4 | 28 | 19 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 164 | 1662 | 312044 | 0 | 0 | 4 | 0 |
5 rows × 85 columns
df['whois_registered_domain'].unique()
array(['0', '1'], dtype=object)
print(df['status'].value_counts())
df['status'].value_counts().plot(kind='bar', title='Count the target variable')
status 0 9716 1 9715 Name: count, dtype: int64
<Axes: title={'center': 'Count the target variable'}, xlabel='status'>
A heatmap will be used to select a suitable set of features to predict the status target upon. At this stage, I have zero idea which feature to use and I utilized heatmap to find features with the most corellation with the target feature.
import seaborn as sns
import matplotlib.pyplot as plt
corr = df.corr()
plt.figure(figsize=(100, 100))
plot = sns.heatmap(corr, annot=True, fmt='.2f', linewidths=2)
# Sorting the correlation values with the target variable in descending order
corr.drop('status').sort_values(by='status', ascending=False).plot.bar(y='status', title='Correlation with the target variable', figsize=(20, 10))
<Axes: title={'center': 'Correlation with the target variable'}>
# Finding the most correlated features with the target variable based on numerical featrures excluding NaN values
correlation_matrix = df.corr(numeric_only=True)
sorted_corr = correlation_matrix.sort_values(by='status',ascending=False)
sorted_corr
| url_length | hostname_length | total_of. | total_of- | total_of? | total_of/ | total_of_www | ratio_digits_url | phish_hints | nb_hyperlinks | domain_in_title | domain_with_copyright | google_index | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| status | 0.244348 | 0.240681 | 0.205302 | -0.102849 | 0.293920 | 0.240892 | -0.444561 | 0.356587 | 0.337287 | -0.341295 | 0.339519 | -0.175469 | 0.730684 | -0.509761 | 1.000000 |
| google_index | 0.233061 | 0.216919 | 0.208764 | -0.018285 | 0.202097 | 0.289212 | -0.357215 | 0.323157 | 0.279906 | -0.269482 | 0.265933 | -0.144499 | 1.000000 | -0.386721 | 0.730684 |
| ratio_digits_url | 0.434626 | 0.171761 | 0.224194 | 0.110341 | 0.325739 | 0.206925 | -0.211165 | 1.000000 | 0.096967 | -0.128915 | 0.152393 | -0.027357 | 0.323157 | -0.181489 | 0.356587 |
| domain_in_title | 0.124224 | 0.218850 | 0.108442 | 0.009843 | 0.092191 | 0.088462 | -0.178402 | 0.152393 | 0.125857 | -0.217548 | 1.000000 | 0.076105 | 0.265933 | -0.332742 | 0.339519 |
| phish_hints | 0.332000 | -0.019901 | 0.168765 | 0.065562 | 0.208052 | 0.501321 | -0.090812 | 0.096967 | 1.000000 | -0.112423 | 0.125857 | -0.066130 | 0.279906 | -0.203464 | 0.337287 |
| total_of? | 0.523172 | 0.164129 | 0.353133 | 0.035958 | 1.000000 | 0.243749 | -0.115337 | 0.325739 | 0.208052 | -0.112604 | 0.092191 | -0.046123 | 0.202097 | -0.123151 | 0.293920 |
| url_length | 1.000000 | 0.217586 | 0.447198 | 0.406951 | 0.523172 | 0.486490 | -0.067973 | 0.434626 | 0.332000 | -0.098101 | 0.124224 | -0.004281 | 0.233061 | -0.099900 | 0.244348 |
| total_of/ | 0.486490 | -0.061203 | 0.242216 | 0.204793 | 0.243749 | 1.000000 | -0.005628 | 0.206925 | 0.501321 | -0.073183 | 0.088462 | -0.023213 | 0.289212 | -0.113861 | 0.240892 |
| hostname_length | 0.217586 | 1.000000 | 0.406834 | 0.059480 | 0.164129 | -0.061203 | -0.130991 | 0.171761 | -0.019901 | -0.104614 | 0.218850 | 0.073107 | 0.216919 | -0.160621 | 0.240681 |
| total_of. | 0.447198 | 0.406834 | 1.000000 | 0.049303 | 0.353133 | 0.242216 | 0.068290 | 0.224194 | 0.168765 | -0.093994 | 0.108442 | 0.057320 | 0.208764 | -0.098752 | 0.205302 |
| total_of- | 0.406951 | 0.059480 | 0.049303 | 1.000000 | 0.035958 | 0.204793 | 0.045756 | 0.110341 | 0.065562 | -0.004513 | 0.009843 | 0.020914 | -0.018285 | 0.104676 | -0.102849 |
| domain_with_copyright | -0.004281 | 0.073107 | 0.057320 | 0.020914 | -0.046123 | -0.023213 | 0.087826 | -0.027357 | -0.066130 | 0.192159 | 0.076105 | 1.000000 | -0.144499 | 0.057127 | -0.175469 |
| nb_hyperlinks | -0.098101 | -0.104614 | -0.093994 | -0.004513 | -0.112604 | -0.073183 | 0.114259 | -0.128915 | -0.112423 | 1.000000 | -0.217548 | 0.192159 | -0.269482 | 0.221066 | -0.341295 |
| total_of_www | -0.067973 | -0.130991 | 0.068290 | 0.045756 | -0.115337 | -0.005628 | 1.000000 | -0.211165 | -0.090812 | 0.114259 | -0.178402 | 0.087826 | -0.357215 | 0.110745 | -0.444561 |
| page_rank | -0.099900 | -0.160621 | -0.098752 | 0.104676 | -0.123151 | -0.113861 | 0.110745 | -0.181489 | -0.203464 | 0.221066 | -0.332742 | 0.057127 | -0.386721 | 1.000000 | -0.509761 |
# Get all the correlated features with the target variable
num_features = len(sorted_corr['status']) # 15 features
sorted_corr['status'].head(num_features)
status 1.000000 google_index 0.730684 ratio_digits_url 0.356587 domain_in_title 0.339519 phish_hints 0.337287 total_of? 0.293920 url_length 0.244348 total_of/ 0.240892 hostname_length 0.240681 total_of. 0.205302 total_of- -0.102849 domain_with_copyright -0.175469 nb_hyperlinks -0.341295 total_of_www -0.444561 page_rank -0.509761 Name: status, dtype: float64
# List the features from the previous step into a list
# selected_features = ['google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/','hostname_length','total_of.', 'total_of-','domain_with_copyright','nb_hyperlinks','total_of_www','page_rank']
selected_features = sorted_corr['status'].head(num_features).index.tolist()
df[selected_features] = df[selected_features].apply(pd.to_numeric, errors='coerce')
# Check the data types of the selected columns after conversion
print(df[selected_features].dtypes)
# Check if 'status' column exists and has categorical or numerical data
print(df['status'].dtype)
# Create a DataFrame with the selected columns
selected_df = df[selected_features + ['status']]
selected_df.head()
google_index int64 ratio_digits_url float64 domain_in_title int64 phish_hints int64 total_of? int64 url_length int64 total_of/ int64 hostname_length int64 total_of. int64 total_of- int64 domain_with_copyright int32 nb_hyperlinks int64 total_of_www int64 page_rank int64 dtype: object int64
| google_index | ratio_digits_url | domain_in_title | phish_hints | total_of? | url_length | total_of/ | hostname_length | total_of. | total_of- | domain_with_copyright | nb_hyperlinks | total_of_www | page_rank | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.108696 | 1 | 0 | 1 | 46 | 3 | 20 | 3 | 0 | 1 | 143 | 1 | 5 | 1 |
| 1 | 1 | 0.054688 | 1 | 2 | 0 | 128 | 3 | 120 | 10 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 1 | 0.000000 | 1 | 0 | 0 | 52 | 4 | 25 | 3 | 0 | 0 | 3 | 1 | 0 | 1 |
| 3 | 0 | 0.142857 | 1 | 0 | 0 | 21 | 3 | 13 | 2 | 0 | 1 | 404 | 1 | 0 | 0 |
| 4 | 0 | 0.000000 | 0 | 0 | 0 | 28 | 3 | 19 | 2 | 0 | 0 | 57 | 1 | 4 | 0 |
# Count the number of binary columns in the selected features
features_binary = count_binary_columns(df[selected_features])
features_binary
['status', 'google_index', 'ratio_digits_url', 'domain_in_title', 'phish_hints', 'total_of?', 'url_length', 'total_of/', 'hostname_length', 'total_of.', 'total_of-', 'domain_with_copyright', 'nb_hyperlinks', 'total_of_www', 'page_rank']
from sklearn.preprocessing import StandardScaler
# Scale the data
selected_df = selected_df.dropna()
scaler = StandardScaler()
selected_df[selected_features] = scaler.fit_transform(selected_df[selected_features])
from pandas.plotting import scatter_matrix
scatter_matrix(selected_df, alpha=1, figsize=(60, 60), diagonal='hist')
plt.show()
# Create pairplot
sns.pairplot(selected_df, hue='status', palette='Set1')
# Show the plot
plt.show()
target = 'status'
X = df[selected_features]
y = df[target]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 19431 observations, of which 15544 are now in the train set, and 3887 in the test set.
# SUPPORT VECTOR MACHINE SVM
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.8422948289169025
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
report = classification_report(y_test, predictions)
print(report)
precision recall f1-score support
0 0.87 0.82 0.84 1982
1 0.82 0.87 0.84 1905
accuracy 0.84 3887
macro avg 0.84 0.84 0.84 3887
weighted avg 0.84 0.84 0.84 3887
# LINEAR REGRESSION
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.6897128732856885
import shap
# Shap explainer initialized with the model and training data
explainer = shap.Explainer(model, X_train)
# Calculate Shap values for the predictions made on the test set
shap_values = explainer.shap_values(X_test)
# Plot the Shap values using bee swarm plot
shap.summary_plot(shap_values, X_test)
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
# K-NEAREST NEIGHBORS
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9114998713660921
# DECISION TREE
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(min_samples_leaf=40, min_samples_split=300)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Accuracy:", score)
Accuracy: 0.9336249035245691
target_names = ["phishing", "legitimate"]
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
from sklearn.tree import plot_tree
plot_tree(model, fontsize=8, feature_names=selected_features, class_names=target_names)
plt.show()